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1.  Introduction 

Work  has  been  concentrated  in  three  areas;  system  design  and  applications, 
memory  design,  and  transmitter  design.  Goals  for  each  of  these  areas  have  been 
determined  and  work  has  progressed  to  provide  detailed  simulations  of  the  OFTIMUL 
(for  Optically  Interconnected  Multiprocessor)  system  and  demonstrate  its  usefulness  in 
a  number  of  applications.  A  preliminary  design  for  the  system  was  completed  and 
plans  for  fabrication  of  a  prototype  system  were  developed. 

Professor  Kowel  attended  the  1988  ACM  International  Conference  on 
Supercomputing  in  July,  and  presented  the  paper,  entitled  "  OPTIMUL.  An  Optical 
interconnect  for  Multiprocessor  Systems”,  included  in  the  Appendix.  One  of  the  major 
invited  talks,  by  Carl  Ledbetter,  President  of  ETA,  dealt  with  the  challenges  of 
obtaining  a  factor  of  10  improvement  in  supercomputer  performance.  He  showed  that 
fundamental  physical  constraints  as  well  as  practical  fabrication  problems  rule  out 
success  by  traditional  technological  paths.  He  mentioned  two  possible  paths  to  gain 
significant  improvements  -  software,  and  hybrid  electronic/optical  systems.  The  work 
done  during  this  reporting  period  encourages  us  in  our  belief  that  we  have  a  promising 
solution,  based  on  both  categories. 

2.  System  Design  and  Applications 

Both  an  optical  read  and  an  optical  read/write  system  have  been  evaluated  for 
their  potential  increase  in  speed  in  a  multiprocessor  environment.  It  may  turn  out  that 
the  potential  increase  in  speed  is  greatest  in  loosely  coupled  systems.  Database 
applications  were  studied  extensively  as  an  application  which  can  benefit  greatly  from 
the  use  of  optical  memory  interconnect  as  proposed  in  OPTIMUL.  The  use  of  the 
OPTIMUL  system  in  projection,  sort  and  join  operations  were  studied.  Select  and 
project  operations  lend  themselves  easily  to  simple  multiprocessor  operations  and 
optical  interconnect  will  allow  for  reception  of  the  partitioned  task  without  contention 
and  subsequent  delay.  Sorting  and  joining  operations  have  also  been  studied  and 
algorithms  utilizing  optical  interconnects  were  developed.  The  results  of  this  portion  of 
the  work  have  been  accepted  for  the  Eighth  Annual  IEEE  International  Phoenix 
Conference  on  Computers  and  Communications  to  be  held  in  Scottsdale,  AZ,  in 
March,  1989.  A  copy  of  this  manuscript  is  included  in  the  Appendix. 

In  addition  to  the  work  performed  in  the  area  of  relational  database  applications 
for  the  OPTIMUL  system,  the  following  tasks  were  undertaken: 
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1.  Identification  of  a  particular  computational  problem  which  will  benefit 
most  greatly  from  the  OPTIMUL  technology.  Possible  problems  include 
pattern  recognition  and  classification,  weather  prediction,  or  expert 
system  tasks. 

2.  Algorithm  development  for  solution  of  the  identified  problem  using  an 
optically  interconnect  multiprocessor  system. 

3.  Simulation  of  the  system  using  various  sized  memories  and  transfer 
rates,  and  various  system  configurations. 


2.  Memory  Design 

A  preliminary  design  of  an  Optically  Writeable  Ram  Cell  (OWRC)  has  been 
completed  and  simulations  of  the  device  performed.  The  device  is  essentially  a  fast 
static  ram  cell  with  the  capability  of  being  written  to  optically.  Electronic  writes  may  also 
be  performed  in  which  case  the  device  acts  as  a  standard  memory  device.  The  circuit 
uses  reverse-biased  photodiodes  to  act  as  optical  detectors;  these  detectors  are 
modeled  as  current  sources  which  generate  current  in  proportion  to  the  amount  of 
illumination.  The  most  important  feature  of  this  design  is  that  the  device  acts  as  a 
differential  detector  and  can  determine  small  differences  between  a  reference  beam 
and  the  information  beam.  In  this  way  the  memory  information  may  be  transmitted  to 
the  receiver  without  full  modulation  of  the  incident  beam.  As  discussed  in  this  report, 
simulations  have  shown  that  with  a  modulation  level  of  less  than  1%,  the  memory 
information  can  be  received  in  10ns. 

2.1  Basic  Operation 

The  OWRC  consists  of  three  major  functional  blocks:  1)  input  circuitry,  2)  the 
differential  amplifier/SRAM  cell,  and  3)  the  output  circuitry.  It  has  two  data  inputs  and 
requires  four  controlling  clocks.  Two  of  these  clocks  require  the  inverse  signal  to  drive 
the  P-channel  devices.  The  complements  can  be  generated  by  adding  inverters  to  the 
cell  but  to  maintain  a  minimum  size  they  have  been  assumed  to  be  provided.  Thus 
there  are  a  total  of  8  inputs  (6  for  clock  signal  and  2  data).  The  results  of  circuit 
simulations  of  the  devices  are  shown  in  Figures  4  and  5. 

2.2  Detailed  Operation 

OWRC  employs  differential  optical  inputs  to  maximize  resolution  and  to 
minimize  the  constraints  on  the  optical  system  which  will  be  supplying  the  input 
signals.  The  input  circuit  is  shown  in  Figure  la  and  the  corresponding  circuit  model 
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shown  in  Figure  1b.  The  positive  and  negative  inputs  are  identical  in  every  way.  The 
input  circuitry  drives  the  differential  amplfier  stage  which  can  be  viewed  as  a  large 
capacitance. 

The  optical  receivers  are  reverse-biased  diodes  which  will  conduct  a  current 
proportional  to  its  illumination.  Since  the  diodes  may  be  under  contstant  illumination., 
the  nodes  in_pos  and  iN_NEiG  will  normally  be  at  5  volts.  To  operate  the  circuit,  the 
nodes  pos_samp  and  neg_samp  must  first  be  discharged  to  ground  so  that  they  will 
start  charging  from  the  same  level.  This  is  done  by  asserting  shrt_clk  which  drives 
the  gates  of  mii  and  M21  to  charge  pos_samp  and  neg_samp.  The  final  voltages  are 
determined  by  the  illumination  and  the  duration  of  samp_clk.  With  a  different  amount 
of  illumination  on  the  two  diodes  (representitive  of  a  one  in  memory),  the  pos_samp 
and  NEG_SAMP  nodes  will  charge  at  different  rates  and  will  thus  be  at  different  final 
voltages  when  samp_clk  is  negated. 


2.3  Differential  Amplifier  /  Static  Ram  (SRAM)  Cell 

The  differential  amplifier,  as  illustrated  in  Figure  2,  consists  of  two  cross 
coupled  CMOS  inverters  with  two  additional  FETs  to  allow  the  application  and  removal 
of  power  to  the  two  inverters.  Once  the  sample  has  been  taken  and  samp_clk  has 
been  negated  the  difference  may  be  evaluated  by  asserting  the  evaluation  clocks 
EVAL_CLK  and  not_eval.  As  can  be  seen  in  Figure  2,  eval_Clk  drives  an  N-channel 
FET  which  connects  the  inverters  to  ground  and  not_eval  drives  the  complementary 
P  channel  FETS  which  connects  the  inverters  to  power.  When  power  and  ground  are 
applied  to  the  inverters  they  amplify  the  difference  between  the  two  sample  nodes 
(inputs  to  the  inverters)  and  settle  with  the  higher  one  at  5  volts  and  the  lower  one  at 
ground.  It  is  important  that  the  input  circuitry  supply  current  such  that  these  nodes 
charge  to  a  level  Vq  which  is  constrained  such  that  (Vdd  -Vthp)  >  Vq  >  (Vss  +Vthp)- 
Failure  to  meet  this  condition  will  result  in  either  the  P  or  N  channel  devices  in  both 
inverters  being  off  when  evaluation  starts  causing  unpredictable  results.  eval_CLK  is 
held  high  as  long  as  it  is  desired  to  maintain  the  data  in  the  RAM  cell. 


I  By  - _ — — 

I  D 1  at,  r,  1  but  1  mi/ _ 

I  Availability  Codes 
1  lAvali  and/or 

Dial  t  Special 
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2.4  Output  Stage 

It  is  important  that  the  capacitance  from  the  nodes  pos_samp  and  neg_Samp  be 
closely  matched  since  deviation  from  perfect  matching  will  result  in  degradation  of  the 
resolution  of  the  difference  amplification.  With  this  in  mind,  the  output  of  the  OWRC  is 
also  differential.  This  is  for  capacitance  matching  purposes.  The  output  stage  consists 
of  a  simple  CMOS  pass  gate  which  is  shown  in  the  diagram  of  the  complete  circuit 
(Figure  3).  Data  may  be  read  out  of  the  RAM  cell  anytime  after  it  has  settled  by 
asserting  enl_clk  and  inv_enl,  thus  connecting  output  (iiMv_cuT)  to  pos_samp 
{NEG_SAMP). 


POS_SAMP 


NEG_SAMP 


a 
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IN_POS,  IN_NEG 


b 

Figure  1.  Input  circuit  (a)  and  model  for  photodiodes  (b). 
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Figure  2.  Differential  Amplifier /SRAM  Cell 
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2.5  Balanced  Receiver  Design 

During  the  last  period  of  this  contract,  the  balanced  receiver  was  analyzed  as  a 
possible  input  structure  for  the  memory.  The  two  photodiodes  act  as  a  differencing 
element  for  the  optical  signals  received  as  shown  in  Figure  5.  By  using  this  as  an  input 
to  the  memory  circuit  described  oreviously,  it  should  be  possible  to  reduce  the 
complexity  and  thus  the  real  estate  requirements  for  the  receiver.  The  use  of  a 
coherent  receiving  system  was  also  investigated,  as  shown  in  Figure  6.  The 
advantage  of  this  receiving  system  is  that  it  does  not  require  polarizing  filters  and  has 
better  theoretical  sensitivity,  but  requires  more  sophisticated  optics  to  create  the 
interference. 


Figure  5.  Balanced  receiver  for  detection  of  amplitude  modulated 
signals.  Reference  cell  transmits  the  dark  level  while  the  data  pixel 
transmits  the  memory  information. 
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DATA 


DATA 


Figure  6.  Balanced  receiver  used  as  a  coherent  detector.  In  this  case  no 
polarizing  filters  are  necessary  and  the  differential  phase  shift  between 
the  two  pixels  is  measured.  This  configuration  has  the  highest  theoretical 
sensitivity  but  requires  vibration-free  optics  and  a  coherent  source. 


11 


SCEEE/RADC  Final  Report 


Revised.  March,  1989 


3.  Optical  Design 

Based  on  the  preliminary  design  of  the  optical  receiving  elements,  an  optical 
budget  can  be  estimated  for  the  system.  Figure  7  illustrates  the  incident  power  on  a 
transmitting  element  and  the  subsequent  propagation  and  losses  in  a  system 
containing  8  receiving  arrays. 


Figure  7.  Optical  Schematic  for  Budget  Calculations. 

Detailed  calculations  based  on  the  specifications  of  available  CCD  devices  as 
receivers  and  ferroelectric  liquid  crystals  as  the  modulation  coating  have  been  made. 
They  reveal  that  1  Watt  of  input  power  is  sufficient  to  drive  64  processors  from  one 
64Kb  shared  memory,  assuming  'no  repeator*  architecture.  This  optical  budget  can 
certainly  be  provided  by  a  modest  gas  laser,  or  by  an  incandescent  source.  For  a  thin 
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solid  film  coating,  the  estimation  is  more  difficult.  Our  curent  AZO-DYE  etalons  provide 
only  0.01%  modulation,  compared  to  nearly  100%  for  the  liquid  crystal  films.  Of 
course,  we  expect  to  make  far  better  etalons  with  better  dyes  as  the  work  continues. 
With  an  improvement  of  100,  we  should  be  able  to  design  electronics  capable  of 
discriminating  the  two  switched  levels. 

4.  Optical  Budget 

Based  on  the  design  of  a  multiple  image  system  based  on  beam-splitters  (as 
shown  in  the  previous  quarterly  report)  an  analysis  of  the  total  modulated  optical 
power  needed  as  a  function  of  the  switching  current  per  bit  was  calculated.  The  total 
optical  input  power  required  is  found  to  be 


n 

s  c  0  -m"  D 


where 


Po=  required  optical  input  power 
n  =  number  of  receiving  arrays 
s  =  scattering  loss  factor  (assumed  to  be  0.5) 
c  =  collimating  lens  loss  factor  (assumed  to  be  0.5) 
m  =  mirror  loss  factor  (assumed  to  be  0.92) 

D  =  receiver  detectivity  (assumed  to  be  0.3A/W) 
is  =  switching  current 

Figure  8  reveals  several  cases.  For  the  simulations  shown  below  the 
modulation  was  considered  to  be  100%  efficient.  For  modulation  efficiencies  less  than 
100%  the  required  optical  power  will  increase  linearly  with  the  decrease  in  modulating 
efficiency.  As  illustrated  in  the  graph,  if  switching  currents  on  the  order  of  10  nA  are 
sufficient  for  switching  the  memory  bits  (with  a  bit  error  rate  of  <  lO'H)  the  system  will 
require  less  than  10  Watts  of  optical  input  power,  even  with  32  processors.  While  it  is 
possible  to  obtain  lasers  with  this  much  continuous  power,  it  will  be  more  feasible  to 
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US£  tiiiC  red  incandescent  light  as  a  source.  Filtering  the  light  from  a  broadband  source 
will  provide  an  inexpensive  yet  strong  {>10W)  source  of  light.  It  is  interesting  to  note 
that  the  Fabry-Perot  etalons  only  have  an  effective  path  length  on  the  order  of  SOOpm 
and  thus  the  coherence  length  of  the  light  needs  to  be  on  the  order  of  1mm.  Since 
ordinary  discharge  lamps  have  coherence  lengths  on  the  order  of  several  mm,  is 
should  be  possible  to  obtain  a  powerful  yet  inexpensive  light  source  for  thin  film 
modulators. 


switching  current  (pA) 


Figure  8.  Switching  current  versus  optical  power. 
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5.  Summary  of  a  Case  Study: 

Sorting  Application 

Assumptions 

-  loosely  coupled  system  (ring  topology) 

-  OPTIMUL  transmission  speed  500  ns 

-  non-  OPTIMUL  speed  50MBits/sec 

-  memory  transfer  size  from  16K  32  bit  integers 
to  256  K  32  bit  integers 

-  effective  OPTIMUL  transmission  1  G Bits/sec 

to  16  G Bits/sec 

•  each  processor  :  VAX  8600  equivalent 


Results:  128K  integer  array 

#  of  processors  speedup  compared  to 

conventional 

multiprocessor  system _ 

2  1:1.05 

8  1:1.17 


32 
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Multiprocessor  Interconnect  Simulation 

Assumptions 

16  processors 

8  memories:  each  512Words,  20  Bits/Word 

Total  Memory  Space  =  4K 

0(n)  problem  with  84%  reads  and  16%  writes 

Adds  256  numbers:  each  processor  adds  16  numbers 

MIC  -1  processors 

Assume  memory  cycles  microcycle 

Results: 

connection  network  ratio  to  ideal  read  ratio  write  ratio 


ideal 

1.0 

1.0 

1.0 

single  bus 

2.15 

crossbar 

1.37 

OPTIMUL-^ 

1.07 

1.1 

1.4 
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Conclusions 


Speedup  achieved  is  dependendent  on  relationship 
between  computation  complexity  and  communication 
complexity 


Some  applications  may  show  a  decrease  of  at  least  an 
order  of  magnitude  In  run  time  using  OPTIMUL 
technology 


Applications  previously  not  feasible  for  multiprocessor 
systems  could  be  run  on  an  OPTIMUL  system 


-  Ferroelectric  liquid  crystals  will  be  suitable 
for  a  protype  system 


•  Modulation  has  been  demonstrated  in  solid  nonlinear 
films  and  1C  surface  modulators  appear  feasible. 
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6.  Conclusions 

During  the  contract  work  has  progressed  in  all  areas  of  the  program.  By 
performing  simulations  at  both  the  system  and  circuit  level,  we  are  able  to  predict  the 
overall  performance  of  an  OPTIMUL  system  and  allow  for  the  development  of  a 
preliminary  design.  This  design  allows  for  implementation  in  the  immediate  future 
using  available  materials  such  as  fast  liquid  crystals  but  will  be  applicable  to  other 
technologies  being  developed  such  as  polymeric  electro-optical  thin  film  materials. 

Concerning  the  tasks  which  lie  ahead  of  us,  the  following  should  be  mentioned 
as  the  most  crucial  ones: 

a.  Further  investigation  of  hardware/software  methods  for  interprocess  and 
interprocessor  communication. 

b.  Investigation  into  topologies  (tree,  ring,  grid,  etc.)  in  order  to  determine  how 
best  to  exploit  the  tremendous  communication  bandwidth  of  OPTIMUL. 

This  program  has  received  considerable  visibility  during  the  contract  period.  In 
addition  to  the  two  reviewed  conference  papers  (1988  International  Conference  on 
Supercomputing;  1989  IEEE  International  Phoenix  Conference  on  Computers  and 
Communications),  further  evidence  of  the  value  of  the  effort  can  be  found  in  the  fact 
that  the  US  Office  of  Trademarks  and  Patents  has  notified  us  of  the  allowance  of  a 
patent  application  (Electro-Optical  Interface)  submitted  previously  to  the  award  of  this 
contract. 

The  calculations  included  in  this  report  demonstrate  that  OPTIMUL  can  be  a 
viable  computer  technology  and  that  research  and  development  should  be  vigorously 
pursued. 
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Aa  optical  iDtercor.cert  s  proposed  'or  rr.;  - 
’.iproceasor  svstems  of  beta  -be  ind  ojse-.y  cou¬ 

pled  types  This  oterconoect  to.ves  ir.e  prob  em  of  roc- 
tentioQ  for  memory  aad  cterronr.ect  a  .ae  *.  jbtiy  cou¬ 
pled  case,  and  '.be  prooiem  of  net'*cf>c  r>ott:eneckc  a  ’.ae 
loosely  wupieo  case 

1.  Introduction 

''luitip'oeessor  \rP '  ?ys*.rms  5'..n?  of  p  .-.'e'c  -a- 

aected  >jt  '•.■•:rper.^ieb;  pf'.'‘*?so“i  na-r  •.*•>  porcriu.  'it 
a  ipeedup  fartor  of  p  :a  -on-petat  •••r.a;  po'*'e'  Hv'*eve*. 
a  oai-itaao;:.?  proo.em  :.as  '■een  T. i  poteaiisi  Pas 
Cut  beer.  rea..iab.e.  lue  :*.r  ."rfaeac  of  proceasof- 
'i.err.ory  acG  or  p."jrt-eMor- pr' ->'^sor  otr.rr.-. o.-'a: .00  Th.s 
-as  oero  \zp.  -ase  ;c,r  of  -^P  syiterr-s  *r..cci  are 

-fluiuy  -''/n j  urr^: 

r  yrit.'y- '  \5upif.l  r.*‘- 

T^.e  «  i?c  “  ant  overnra.^  r,  7^  'vnerrj  •anes  tne 
'■'m  of  '..I'ea’-iou  '-'r  :-ie  sr.areG  '•ectri;  ;Tje.O'''fv  . 

I'-i  'of  ‘.be  p-ocASor-memo'y  r.tercooQect.  Tte  af-er 
/rob,em  .s  cxarerba'.ed  Dv  the  'act  ibal  the  expense  of  'u  I 
vT  Asrar  iw  trrea  -fau.ts  .3  tae  jse  of  other  networks  for 
*h.cb  there  s  c' xore  pmeessot  contention  for  the 
G'.er^oanec,  ..-tc/i  Hwang  aad  Sriffs.  for 

;maii  »aiy^  of  p,  ;s«  of  cacCiea  can  be  effective,  but  the 
^thc.eocY  decreases  with  p  V»'ilaoQ,  1987  Furthermore,  -t 
has  'eceotly  been  discovered  that  access  to  mlerprocesj 
synchroQitaiioD  variables  in  shared  memory  worsens  this 
probiem  tremendously  Pffstcr  and  Norton.  1986i. 

Pcr^iiwoci  10  cocrv  without  fat  «U  or  pars  ^  this  msierul  is  graated  pcovvied 
'.bat  iAk  conwa  arc  not  rrn^s  «  dianbwiod  for  direct  commercial  advtnuiaft 
tnc  aCM  coevnghi  oouor  lOd  the  ode  of  the  ovW»cauon  arid  its  date  appear, 
ind  ooQcr  t«  dui  coprmg  11  by  permuaion  of  the  asaocuuon  for 
i.bmpuang  Machinery  To  ^“opy  otherwise,  or  to  repubiish.  mquifes  a  foe  and/ 
or  mecife  prrmuaroa 

®  1988  ACM  0-89791-27:-l  88  0007  0016  SI  50 


rC  <ysiems  have  been  ise-:  —a  -"r;-  ^r-.j  - 

Anub  the  paraile.ism  .s  hne-^-i,.  eu.  'eiv...f 
nterprocesso'  comrr.unicat:or.  T.-'.e  effects  uf 
muDica'-ions  overr.eans  docr.Deu  abo*'’.  ■.bougr.. 
ted  rC  systems  to  r2tDef  sma.,  ituXDe's  of  proc'^so^s 

iooieiy-  C oupied  ;  i  C '  5yitc'nj 

!d  LC  systems,  t.here  s  no  ynarer:  “icr-.-'.- 
3  It'll  cofT.rr.uaicat.ons  overteiu  of  »z.''.-ef  <  . i'** 
processors  ■orr.rriunxa'e  •■vitb  ■•acf.  (.''..“ter  tirooi^t  a 
•voric.  BiP.dwiOih  .imiUt.ODS  r. •.'.is  .r.'.errc-.re*"  >  •••■• 

-^ora  preseci  very  sjbstanlai  ver‘ea:  r'v:  -xa  "  • 

atercl-ustet  accewes  '.p  the  Cm*  m.ach'.r.e  ■•*»-»•  i  r 
^7  ::,Ti«s  S'Ower  acceases  to  o<-a.  T-emc.''' 

Aod  Br  gfs.  .9^4  The  effect  - th.3  ■•"v  ^  i  • 

•.as  been  that  CC  syster.'-i  ha'e  •  -ve-;  '  .i  • 

•oarse-g'S’-ed  ipp.  fat:oos.  n  ’'e  o-*  '-e. 

...erpr-n  'orr.TiUr.icat  on  r.pi.ee  :’.i'  ’  i 

■ray  not  -  a  s  «r..hfan;  ’ar.-^- 

'n  ••••■tb  '.“‘e  AC-  bC  sett.r.^s  acc’.Le'  ?£-  ••  . 

; ‘ooierr.  s  t.he  «evere  ’ratr'ct.cns  .'■•s..t,ng  '’vr.  ;• 

.Ti.tat.cns  bven  ■  r.annc'S  of  vi*'v  -..gp  ba.' 
u  tb^'se  -onst >0  'rom  '■'pt  -a.  hhe-^,  *0'.  -  t  '  ■  ■* 

the  proD>m  ars'ng  f'orr.  the  '"art  T.at  tPe'e  ..  •  * 

^ew  data  p<ni  but  •r.OisanGs  ir  ever,  or.s  f  •  •?  •.  i 
memor.  .'h.p 

The  eot  re  h'Story  of  the  Gev^  oprrert  h'.P  ■ 
ogT  has  !>een  Jorr.iaated  by  i.-.e  «ear-p  fo' 
toese  iroblerrj  biewiorex  el  <xl.  i9*i2.  Mwart  ant 
1984;  Agrawat.  iflgg  Es3ent.a'..y.  no  ccrr.p  ete./  la'. 
tory  joluiions  have  been  ^o»iOd  ror  exarr.n  e.  a’te*  ■  'a* 
ReaeArch.  li:c  -eeased  the  ('-av  \A'.p  ar.  MP 
the  Cray-l  supercomipu’.^''  ’•ecer.t.v,  v  ?.  .m.te'  r.f  r.-r*'  »  > 
lions  Baiiey.  i9S7.  Cheung  ar.o  hm.th.  i9''8  ■  a:.: 

Lange.  1988  -jureiv  mewed  •‘■e  rvyem  'o  '•  ■ 

slowdowns  'I'ue  lo  both  rontenticc  for  shared  -r- e i:,  : 
v'ODtealioD  for  the  retworn  wt,  ,-:i  ornects  the 
to  that  memory,  ^usi  as  <•  th  a  .  the  ear.  er  hfp 
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Perhips  I'D  more  dramaiic  eximij>  's  ihe  S-1  a 

TC  MP  system  aeveiopea  at  Lawr«Qce  LivermDre  Nat;ona» 
Laboratories  H'^ang  acd  Br.i^gs.  1984  Throughout  -.ne 
period  of  developraeni  of  this  system,  it  was  haded  as  one 
of  the  most  advanced  .MP  projects  m  existence.  However, 
'ecentty  the  project  was  discontinued,  m  sp»te  of  aii  the 
favoraoie  publicity,  and  the  very  extensive  funds 
expended  Bruner.  1987L  One  of  the  pr-.mary  reasons 
given  for  the  oiscontmuation  was  'hat  the  project 
engineers  had  found  that  t.n*  contention  for  shared 
memory  m  the  system  v>ou!d  be  muc.h  greater  than  they 
had  anticipated.  They  are  now  beginning  work  on  a  rom- 
pieterv  new  design. 

We  will  present  here  a  radically  new  iOierconceci 
method  which  wtH  solve  these  problems,  and  have  other 
advan* ■  ge,  as  well; 

a)  The  new  interconnect  will  he  usable  for  both 
6ne*grained  and  coarse-grained  types  of  applica¬ 
tions.  Storeover.  it  could  be  applied  to  build 
systems  which  are  equally  effective  on  both  of 
these  application  types,  with  no  .'■econbguration 
time.  Sucb  systems  would  then  also  work  well 
for  ■■medl'_m-graiQed  "  applications,  thus  recog* 
nitiQg  that  the  hne-gramed  and  coarse-grained 
concep^j  arc  merely  two  extremal  representatives 
m  a  broad  range  of  problems  having  varying 
degrees  of  frequency  of  interprocessor  common.* 
cation 

b)  It  will  ?olve  the  'oDg-standiog  problems  of  con- 
tcoiioo  for  memofy  and  for  the 
processor  memory  switch  lo  TC  systems.  There 
Will  be  absolutely  oo  qu^uemg  delay  for  read 
access  to  shared  memory. 

c/  In  LC  systems,  it  will  enable  a  truly  dramatic 
improvement  m  interprocessor  communications 
bandwidth,  and  again  toialiy  eliminate  contcn* 
t.oo  for  the  latercoaoect  switch. 

b)  Although  there  wd!  stil)  be  physical  limitations 
00  the  sue  of  p,  sucb  UmiU  should  be  far  less 
consiraiQiog  than  thoM  of  existing  systems  with 
cooventional  ooooptical  processor  memory  mcer* 
connecU. 

c)  Our  approach  should  also  be  superior  to  other 
optical  processor,  memory  interconnects  which 
have  been  proposed,  e  g  optical  crossbars  Bell. 
1986.  Hutcheson  r(  at  1987  For  example,  an 
optical  crossbar  switch  of  sue  as  large  as  2S6  x 
256  has  been  anticipated,  but  even  this  would 
only  allow  256  simultaneous  bits  to  be  transmit¬ 
ted;  by  contrast,  under  our  approach,  the  entire 
contents  of  a  chip  can  be  transmitted  simultane¬ 
ously.  I  e  thousands  or  even  millions  of  bita  can 
be  sent  m  parallel.  Note  that  this  also  implies 
that  the  pin*limilation  problem  >s  also  elim¬ 


inated.  wn.’cfi  is  a  problem  evfn  n  '..^ose  ircr..- 
fect'j'ps  wp.irn  nave  been  proposed  ba^ed  on  a.n 
optical  .iber  . ntcrcon.nect 

Our  name  for  '.his  new  interconnect  .s  OPTIMLL.  an 
acronym  for  Opcicai  Multiprocessor  Interconnect  The 
central  feature  is  aa  optical  processor-memory  c.nar.nei. 
wr.ich  will  allow  simultaneous  access  of  a  memory  -nip, 
where  me  word  ■s.mu.taneoui  ’  is  meaDl  bath  w-m  respect 
to  an  bits  in  tae  chip,  and  with  resp^Ti  to  all  processors 
In  other  words,  all  processors  can  simuiianeousiy  reaa  tr.e 
entire  contents  of  a  chip  wuh  no  interference  at  ai 
Write  access  ,s  of  course  restricted  to  a  single  p-occssor  at 
\  lime,  but  It  still  is  simultaneous  across  bits  lO  the  chip, 
i  e.  an  entire  chip  can  be  written  in  one  access.  This  opti¬ 
cal  channel  is  described  in  Section  2.  and  then  MP  system 
architectures  uiilixing  it  will  be  proposed  m  Section  3. 
Section  4  will  then  present  some  implementation  deta.,s. 

2.  A  New  Optical  Memory  Access  Channel 

Consider  devices  Dj,  ...  which  wish  to  read  » 
memory  chip  C.  in  which  are  stored  bits  .  B,  Me 

will  report  here  a  techniqi’e  in  which  the  devices  c  n  read 
from  C  optically,  bypassiug  the  need  for  using  the  chip's 
pins,  and  which  will  allow  this  access  m  be  simultaneous, 
with  respect  to  both  devices  D,  and  bits  Bj  (Figure  1). 

To  achieve  this.  C  will  be  coated  with  a  thin  polvm- 
er’c  film,  using  a  Langmuir'Blodgett  (L  B)  or  other  tech¬ 
nique  Kowe!  et  of.  19851  Kowel  et  ai,  1687'.  When  C  s 
iliumiQated.  eg.  by  a  laser,  the  6lm  will  cause  me 
reflected  beam  to  be  intensity-modulated  by  the  eiectnc 
Selds  at  eac.h  position  beoeaih  the  film  in  C.  Thus  the 
reflected  oeam  will  contain  >  complete  bit  map  of  the  con¬ 
tents  of  C.  The  beam  will  be  processed  by  optical 
apparatus  for  focusing  onto  the  receivers 

Demodulation  of  the  beam  back  to  storage  as  e.ectnc 
fields  at  the  receivers  is  Accomplished  ny  the  use  of  pho¬ 
tosensitive  technology.  For  example,  one  possibility  i  '.o 
use  ordina*y  DR.\M  memories,  which  have  a  natural  seo- 
siliviiy  to  light,  ■^his  means  also  that  parts  of  C  must  ne 
masked  from  the  light,  so  that  illumination  of  C  does  not 
change  the  contents  of  bits  in  C.  e  g.  only  the  output  por¬ 
tion  of  a  gate  can  be  exposed-  CCD  or  CID  arrays  are 
also  possibilities  for  use  as  demoduiaton 

In  this  way.  the  values  stored  at  all  the  b'ts  B.  ;n  i. 
can  be  irAnsmitted  optically  to  the  devices  D, .  jirnuifanc 
ouj/y  auer  ail  iubsenpts  i  and  ;  Cleariy,  the  simultaneity 
over  }  Will  have  highly  significant  implications  for  the 
memory  and  interconnect  contention  problems  which  have 
plagued  TC  systems,  while  the  simultaneity  over  both  * 
ana  j  will  have  an  equally  profound  impact  \  the  net¬ 
work  bandwidth  limitation  problem  m  LC  systems  Note 
again  that  the  classical  bottleneck  arising  from  limitations 
on  the  pms-io-stored-biis  ratio  is  completely  bypassed  r 
the  approach  described  here. 
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vs  riles  10  C  ran  bf  iccornpusheO  Ov  .-evi^rsiDf  i.*5e  pro- 
or  Sv  -.^piraie  'b-ps.  is  jl 

''ecuon  J  F,  g  S'.Dpose  an  cr.i.'.y  a^soctaieU  w-to  0. 
wistia  lo  *r!ie  lo  C-  '-tTie  ibe  sys'.^m  is  bujit.  D. 

^ould  b«  roaied  w  th  Ihe  lypx  of  fi.rn  as  described 

for  C  Tbea  lo  write  to  C.  ice  followini?  would  be  done- 
The  eniitv  pieces  the  items  lo  be  written  nto  D.  (m  most 
sysiem  arcbi'.eciures  b»je<l  or.  this  interconnect  tcchnol- 
'i%y  ■  ibe  items  w'i  a  readv  v  C'.  anyway  jee  Section  3) 
Then  the  iHumratcn  to  C  •;  turned  off.  md  0.  is 
:!urr.. rated  insieaa  i  r.e  orfert  s  tr.ai  tbe  contents  of  D. 
are  copieo  bv  (' 

3.  System  .Architectures 

The  optica;  nterror.r.ect  p'-eser.tc':  nere  car.  be  used  m 
a  varieu  of  .  or.ng -rdc.-jns  V\  r.sr  jss  :  ho  examples  n 
this  section 

A.rchltectu  re  I: 

This  w'li  be  t'e  h'st  svviem  to  be  bu-it.  The  system 
will  i-cnaijt  of  rpir  o.  ow,n&  F  gure  U:; 

af  ^  system  bus 

,bi  \  .-er.t'a,  shared  mem.orv  .Vf.,,, .  consist. eg  of 
I  B-coateO  c.iips  Vf-i'  -j  ~  F-  .m 

•  c,  p  processor  memory  bus  modules  In  the  i-th  of 
'hffse.  j  processor  P,  will  be  -lonnected  via  a 
vca.  b  .s  to  oca<  memory'  ,  consisting 

/  i  R-'.-ojted  rzemory  chips  M,j,  j  =  i..  ..m. 
N-v.c  tr.at  n  spite  of  the  name  'local  memory, 
tf..  servrt  pnmariiy  sot  as  .memory  but  ratber 
a:'  a  -eceiver  for  the  re6eci«d  beam  from  Mq,- 
F,  facs  Vf.^.  by  reading  nhich  cootatns  an 
.p-to-'tate  copy  of  at  all  times,  since  the 

..IT. cat  on  of  .Vf'g.  IS  maintained  continuously 

Mem.ory  r^ads  are  handled  optically,  .a  the  manner 
i:e5cr;Ded  above  VVr.tes  are  bandied  electronically, 
thro  igh  me  system  bus  The  system  bus  also  indirectly 
serves  as  a  mecbaoism  for  dealini  with  the  process  syn* 
chroniiation  problem:  Suppose  is  writing  to  a  shared 
variao  e  m  chip  in  and  during  this  time 

Wishes  to  read  that  variable.  How  do  we  te.nporarily 
suppress  access  by  In  an  ordinary  Inoooptical)  sys* 

tern,  this  is  handled  through  special  bus  operations  which 
allow  a  read-modify-write  cycle,  but  chat  would  not  be 
available  m  a  totally  optical  system. 

However,  since  n  this  architecture  we  still  do  have  a 
system  bus  Sq.  the  problem  can  be  solved  by  providing 
logic  wh.rn  Will  compare  the  address  portions  of  5o  and 
S| ,  and  tempcrariiy  suppress  the  signal  to  the  Chip  Select 
pm  on  Mf,  .f  of  this  would  be  Iran- 

spareni  to  both  P,  and  P,.  though  P,  might  'see  '  a 


deiayecJ  me.'Tior'-  a''''«5  'esponse 

■Vote  that  we  .Taie  -.a.d  .-’.oi.'iir.g  .'te-e  apo-.;  og  ca. 
versus  pn>s.cai  strucure  .>f  '.ne  snare-p  att'-.rfns  ^pa:? 

For  example,  .most  ev.st.ng  Tf'  stems  at  ^ 
low-order  .r.'.rr'eav  .ng  s.'r.eme  for  a.ssgr.,3g  .aPC'f^ses 
memory  chips,  the  dea  behind  suen  d  si-cjerr-e  o*‘,rg  '.p 
allc'-'iate  some  of  the  probiems  of  T-/....wfy  '■on-,#in'.,or. 
However,  there  wouid  appear  to  oe  no  par’.,c..ar  advar.- 
tage  lo  this  form  of  addressing  lO  the  optica,  arrp.-eciures 
aescribed  here. 

.Although  this  <yjicm  .-s  ah  b.aed  somew.pai  r:^'.  .-.g 
optical  reads  but  or,iy  convent. onal  wf.tes.  the  oss  m.av 
not  oe  very  large.  Typicaiiy  reads  dom’.nate  w'-.es  b,.  a 
ratio  of  3  I  or  more,  with  some  app.icatior.s  e  g  n-. --e 
database  searches;  having  Virtually  lOO'T';  jf  .me-Tio-'. 
access  being  lO  the  read  mode. 

Architecture  II: 

This  IS  in  some  sense  the  other  extrerr-.e  r, 

spectrum  of  possible  conSgurations  The  are.n  tec- ..re 
is  designed  to  most  fully  exploit  the  trerpience  .s  po'  e^t  a. 
for  parallelism  provided  by  the  optical  tecn.no.ogy 
described  in  Section  2;  the  price  paid  is  m  extra  T.err.ory, 
and  possibly  extra  compuviog  apparatus  ne.  add.t.ooa. 
lenses  and  or  mirrors). 

The  desirable  qualities  for  which  we  are  airr.  rg  n 
architKture  are 

•  both  reading  and  writ^cg  being  done  oot.ca.ly 

•  avoidance  of  using  efecironit  media  for  access  'o 
shared  variables  and  other  syncarocitat'.on  proo* 
le.ms 

•  avoidance  of  delays  due  to  writing,  e  g  raus^d 

by  the  need  to  change  tne  direction  of  na¬ 

tion 

•  avoidance  of  the  need  to  deve.op  cn.p  •.echro'.'w 
gies  wDich  allow  both  op’.cai  read  a.nc  hmc 

access 

To  accomplish  these  goals,  we  have  form,  j  a'.ed  a 
configuration  consisting  of  the  foilowicg  There  w,,  d*» 

p-*l  processor  memory  bus  modules  F.gure  j'  w-.p'p  --.e 
i-ih  module  consists  of  a  processor  F  -^le'r.orv  .r,.'s 
(each  consisting  of  many  c.h;ps.  bui  '’or  «  rnp,.r  iv 
to  here  as  a  single  chip  )  .Vf,.,  .  and  bi.s  '.  \:"0' g  ’  "  f 

memory  onus  .  for  Module  >,  me  sud^c-  p:  'i'.ge< 
from  I  lo  p  for  i  =  0,  while  fo*  i  ■  0  j  s  ;  .a  '  '  . 
only;  the  subscr-pi  K  taices  on  the  vaues  0  a.-.J  .  '  a:  v 

value  of  i. 

Here  is  how  the  system  works  Modu  es  I  •‘■•'.gh  p 
do  the  computation,  while  tbe  leader  Moou  e  0  ''.j-.ages 
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th«  op«r»tiOQ  of  ’.he  lysi^m  aod  <erv^  as  a 
sour<*e;  coiieciof  Mor«  ip^cificaily 

a)  Any  memory  rbip  m  ihe  .fad  modu'.e  of  '.he 
labelini  .'‘orm  fj  >  0)  n  L  B-coateo  and 

under  conataol  illumination,  and  \%  ronstaniiy 
reod  by  the  computalionai  professor  tnrouj^b 
P,  i  local  memory  V,  jg.  as  .n  Architecture  I  Aa 
before,  this  is  accomplished  by  v  ri  je  of  the  fact 
that  .W,  ,Q  will  always  have  an  up-to-oate  copy  of 
the  contents  of 

In  this  coabguratiOD,  each  computational  oroce* 
sor  has  read  access  to  a  different  set  of  cn;ps  -n 
the  lead  memory  An  alternative  --onhuuration. 
which  we  do  not  discuss  in  detail  here  for  the 
sake  of  notai.onal  simpiicsly,  v»-cuid  be  to  have 
ail  read-chips  in  the  lead  memory  'eadable  by  all 
computational  processors.  This  would  ecooomne 
on  memory,  and  would  increase  speed  somewhat 
in  applications  in  which  the  same  data  is  to  be 
broadcast  to  all  computational  processors. 
Note,  however,  that  in  this  conuguratioa.  there 
still  would  be  separate  write-chips  for  each  com¬ 
putational  processor,  i.e.  nothing  iq  (b)  below 
would  be  changed  , 

(b)  The  other  lead  memory  chips,  of  the  labeling 
form  .Wg, ,.  are  intended  to  be  w^uttn  to  by  the 
computational  processors  P,  (i  >  0}.  We  will 
also  describe  this  as  reading  the  memories 
M.  1,.^  This  is  accomplished  ny  having  the 
corresponding  chips  <n  the  romputaiiona) 
moduio.  of  the  iabeling  form  •'■/.i,.  L  B-coated 
and  under  constant  illumtoattoo.  so  that  .Vfg, , 
will  constantly  have  an  up>*to-date  copy  of  -V,,,. 

e)  The  process  syochromtation  problem  will  be 
handled  by  message  passing  Peterson  &  Silber* 
schatt.  19851  A  computational  processor  P,  can 
access  a  message  from  the  lead  processor  by 
‘rading  a  variable  in  Mq.q;  the  lead  processor 
can  obtain  a  messafe  from  a  computational  pro¬ 
cessor  by  a  similar  read  of  some  M, ,i-  la  effect, 
tnere  are  no  shared  variables. 

d)  If  a  computational  proceuor  P,  (t  >  0}  wants  to 
Af,  for  ‘scrauli"  wort,  it  can  close  and 
eiectrooic  shutter,  temporari'y  shielding  that 
chip  from  the  cooiUnt  updating  by  Mq^q- 

The  motivation  for  such  a  conOguration  can  be  seen 
hy  the  followiQg  iwg  examples  of  the  use  of  this  system, 
ihe  first  performing  at  sorting  operation,  and  the  second 
one  a  Gauuiao  elimioatioa  procedure.  .Voi<  that  moximo/ 
P«ca//e/ijfn  u  ohlaintd  m  both  fxamplta,  there  i*  no  con- 
itnitofi  far  iTierq^ry,  for  shared  ^^ar^able$.  or  for  irttcrcon- 
'ff  'icfuorfc  'cioureri.  and  all  data  transfers  (e  g.  transfer 
,fi  £tamp/«  if  are  /uf/y  parallel  fe.j  a  whole 


suiarray  can  be  transferred  m  an?  'ncmo^y  ct/eie: 

Example  L:  Sortir^g 

The  lead  processor  will  hr^ak  .ip  ■-te  i'r.av  ^.-.o 
eg  as  iQ  Quicksort  or  Mergesort  ,  ser..:  • -,»*rr’,  '  .  ■  u  ■ 

‘'omputational  processors  for  sort.ng  ■nr-n  ’.f"'. 

'o  pfoouce  the  sorted  version  of  '.he  or  o  -i.  i--^. 

The  >ad  processor  P.^  executes  the  'oi.oh  ng  -o-;e 

form  p  subarrays  iQ  chips  of  ihc  aoei  rg  'or-'. 
''^0.0 

set  a  Ready  variable  m  these  h,ps 
.'"or  I  =  I  to  p  do 
begin 

watch  .Vf,  5j  for  P,  's  Done  variable  'o  oe  set 
read  .Vf,  j,  'o  get  the  sorted  subarray 
end 

rximbme  the  sorted  subarrays,  yielding 
the  sorted  original  array 

P,  executes  the  following  code; 

watch  Afoio  Ready  variable  to  be  set 

get  subarray  from 

sort  subarray 

set  Done  variable  in  .Vf,  ^ 

Example  2:  Gauaeian  EUmination 

Each  computational  processor  is  assigned  a  group  of  ''on- 
tiguous  columns  m  the  matrut.  Below  we  give  part.ai 
code,  showing  an  operation  on  a  particular  d.agooa.  ele¬ 
ment.  say  the  d-th. 

The  lead  processor  Pg  executes; 

put  the  value  of  the  divisor  i  reciprocal  of  diag  e.t ; 

ih  a  variable  m  the  chips  -W^.g 
set  a  Ready  variable  m  these  chips 
for  i;  =  I  to  p  do 

watch  .Vf,  jj  for  P, 's  Done  variable  to  be  set 
retrieve  final  matrix  from  the 

P,  executes; 

watch  -Afg.g  for  Ready  variable  to  be  set 
gel  divisor  from  Vfg.g 
for  all  columns  assigned  to  P,  do 
begin 

divide  by  dlvisor.  yielding  w 
for  alt  rows  except  d  do 

subtract  w  from  this  irow.coii  eierr.^-'.t 

end 

set  Done  variable  in  .Af, , , 
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Other  Architectures: 


The  above  iwo  architecturea  are  just  two  examples; 
many  other  coofigurations  are  possible.  For  example, 
purely  LC  systems  can  be  formed,  e.g  as  nog  networks. 
But  again,  instead  of  the  serial  interprocessor  commuoica* 
tioo  available  m  ordinary  ring  networas.  the  optical  chan* 
Qeis  introduced  here  would  provide  exceedingly  highly 
parallel  communication. 

.Viother  possibility  is  to  set  up  memory  hierarchies. 
In  this  setting,  motivated  by  a  desire  to  conserve  on 
optical  apparatus,  only  some  memory  access  would  be 
optical,  with  the  optically  accessed  memories  serviog  m 
the  role  of  cache  front  enos  for  much  larger  memories. 

4.  Some  Implementation  Details 

lmp<*rnentaiior.  of  OPTiMt'L  requires  materials  and 
components  for  the  illumination,  eiectro-opnc  conversion 
of  data,  and  the  suosequent  conversion  of  the  data  back 
to  an  electrical  signal.  The  iilummation  for  the  system  is 
provided  by  a  laser  at  a  suitable  wavelength  and  with 
suitable  optical  power,  as  determined  by  the  other 
components  materials  m  the  system.  Appropriate  optica 
would  be  used  to  focus  the  beam  on  the  processor  imaging 
arrays.  The  simplest  implementation  for  OPTIMUL 
involves  coating  the  communication  arrays  with  either 
advanced,  ultrafast,  ferroelectric  liquid  crystals  Johnson 
<t  a/.  1987  .  Of  organic  thin  solid  61ms-  Liquid  crystal 
coatings  nave  been  appiieo  to  integrated  circuits  in  order 
to  create  optical  diagnostics  m  order  to  overcome  external 
pm  limitatio.ns  <n  testing  Burns.  1979  .  Birefringence 
induced  by  the  circuit  voltages  creates  an  loieosity  display 
in  the  liquid  crystal  coating  m  much  the  same  manner  as 
in  digital  watches.  The  nanosecond  response  of  this 
material  translates  into  data  rates  of  10*6,  10'*  s.  approx* 
imateiy  10‘^  b  s.  for  each  processor. 

.Nonlinear  organic  thin  solid  film  materials  are 
currently  of  great  interest  for  use  in  integrated  optics 
Mourou  and  Meyer.  I984i  and  studies  >f  the  fuodameD* 
tal  characteristics  of  suc.^  materials  Carito  and  Singer, 
1981  iQOicate  that  they  posses  oooiineai  optical  figures  of 
merit  which  are  many  ordcn  of  magnitude  greater  than 
inorganic  materials  such  as  LtNbOi  and  tiTsO),  as  veil 
as  offering  response  times  approaching  femtoseconds.  The 
most  likely  mechanism  to  be  employed  in  these  films  is 
the  linear  electrooptic  effect,  whereby  a  change  in  polari¬ 
zation  proportional  to  the  chip  voltage  would  be  induced. 
A  polarizer  inserted  above  the  chip  would  produce  an 
miensiiv  map  of  the  array  gates  from  the  electrically 
induced  birefringeoce  m  the  film.  We  are  working  to  depo¬ 
sit  such  films  using  spin-coating,  as  well  as  by  the 
Langmuir  Blodgett  technique.  The  Langmuir ’ Blodgett 
method  IS  based  on  the  sequential  extraction  of  molecular 
monolayers  from  a  liquid  surface  onto  a  substrate  Kowel 
ft  ol.  1987 


Another  approach  would  be  to  use  a  material 
electrochromic  property,  which  would  produce  an  oten. 
siiy  map  of  the  electrodes  directly  through  <ierir,ra,a 
induced  absorption  of  light.  The  Starx  effect  .►las  be»n 
used  to  characterize  L  B  films  Blmov  (t  al.  1984  ino 
would  be  an  alternative  technique  N^Oich  woutd  coi 
require  polarizers  above  ibc  film.  Even  though  such  an 
interaction  is  likely  to  be  slower  than  the  eiectro-opiic 
effect.  It  may  be  a  feasible  impiemeniation  since  so  large 
.\o  amount  of  data  is  transferred  simuitaneousiy 

The  deposition  of  these  films  should  prov.de  excelieni 
topographic  coverage,  be  physically  and  chemically 
robust,  and  be  of  very  uniform  thickness  and  optical  qual¬ 
ity. 

A  large  number  of  processors  can  be  accommodated 
by  introducing  fly's-eye"  optics  capable  of  imaging  ibe 
shared  memory  contents  onto  a  large  number  of  proces¬ 
sors.  as  depicted  in  Figure  4.  CCD  arrays  are  used  as 
receiver/ transmitters  id  that  figure,  but  as  mentioned 
before.  DRAM  or  other  technology  is  possible. 

This  configuration  also  allows  for  broadcast  of  a  svs- 
lem  clock  from  the  shared  memory,  so  that  ail  processors 
can  run  in  a  synchronous  mode  if  desired,  although  they 
may  generate  multiple  phases  or  frequencies  from  the 
master  clock  for  internal  use.  The  clock  could  consist  of 
the  controller  for  the  Q-swiubed  lasers  which  scnaiW 
illuminate  the  various  arrays  for  their  writing  opportuni¬ 
ties. 
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Performance  Analysis  of  the 
OPTIMUL  Multiprocessor  Interconnect 

N.  Matloff,  T.  Schubert,  S.  Kowel,  C.  Eldering,  M.  Loving 
Department  of  Electrical  Engineering  &  Computer  Science 
University  of  California  at  Davis 


1.  Introduction 

Multiprocessor  (MP)  systems,  consisting  of  p  interconnected  but  independent  proces¬ 
sors,  have  the  potential  for  a  speedup  factor  of  p  in  computational  power.  However,  a 
long-standing  problem  has  been  that  this  potential  has  not  been  realiiable,  due  to  the 
overhead  of  processor-memory  and/or  processor-processor  communication.  This  has  been 
the  case  for  both  types  of  MP  systems  which  are  usually  considered: 


Tightly- Coupled  (TC)  Systems: 

The  very  significant  overhead  in  TC  systems  takes  the  form  of  contention  for  the 
shared  central  memory  Menti  for  processor- memory  interconnect.  The  latter 
problem  is  exacerbated  by  the  fact  that  the  expense  of  full  crossbar  switches  results  in 
the  use  of  other  networks  for  which  there  is  even  more  processor  contention  for  the  inter¬ 
connect,  e  g.  fl-nelJ  (Hwang  and  Briggs,  198dj.  For  small  values  of  p,  use  of  caches  can 
be  effective,  but  the  efficiency  decreases  with  p  (Wilson,  1987].  Furthermore,  it  has 
recently  been  discovered  that  access  to  interprocess  synchronization  variables  in  shared 
memory  worsens  this  problem  tremendously  (Pfister  and  Norton,  1986). 


Loosely- Coupled  (LC)  Systems; 

• 

In  LC  systems,  there  is  no  shared  memory,  but  there  is  still  communications  over¬ 
head  of  another  kind.  The  processors  communicate  with  each  other  through  a  network. 
Bandwidth  limitations  on  this  interconnection  network  present  very  substantial  over¬ 
head.  For  example,  intercluster  accesses  in  the  Cm*  machine  were  a  factor  of  8.7  limes 
slower  than  accesses  to  local  memory  (Hwang  and  Briggs,  1984). 


In  both  the  TC  and  LC  settings,  another  significant  problem  is  the  severe  restrictions 


30 


raultiog  from  chip  pin  limitations.  Even  channels  of  very  high  bandwidth,  such  as  those 
constructed  from  optical  fibers,  would  not  solve  the  problem  arising  from  the  fact  that 
there  are  only  a  few  data  pins  but  thousands  or  even  millions  of  bits  in  a  memory  chip. 

The  entire  history  of  the  development  of  MP  technology  has  been  dominated  by  the 
search  for  solutions  to  these  problems  (Siewiorek  et  at,  1982;  Hwang  .  nd  Briggs,  1984; 
Agrawal,  1986[.  Essentially,  no  completely  satisfactory  solutions  have  been  found.  For 
example,  after  Cray  Research,  Inc.  released  the  Cray  X-MP,  an  MP  version  of  the  Cray-1 
supercomputer  recently,  a  number  of  investigations  [Bailey,  1987;  Cheung  and  Smith, 
1988;  Oed  and  Lange,  1986|  quickly  showed  the  system  to  suffer  from  slowdowns  due  to 
both  contention  for  shared  memory  and  contention  for  the  network  which  connects  the 
processors  to  that  memory,  just  as  with  all  the  earlier  MP  systems. 

Perhaps  an  even  more  dramatic  example  is  the  S-1,  a  TC  MP  system  developed  at 
Lawrence  Llve'more  National  Laboratories  [Hwang  and  Briggs,  1984|.  Throughout  the 
period  of  development  of  this  system,  it  was  hailed  as  one  of  the  most  advanced  MP  pro¬ 
jects  in  existence.  However,  recently  the  project  was  discontinued,  in  spite  of  all  the 
favorable  publicity,  and  the  very  extensive  funds  expended  [Bruner,  1987|.  One  of  the 
primary  reasons  given  for  the  discontinuation  was  that  the  project  engineers  had  found 
that  the  contention  for  shared  memory  in  the  system  would  be  much  greater  than  they 
had  anticipated.  They  are  now  beginning  work  on  a  completely  new  design. 

Such  problems  have  been  considered  extremely  difficult  to  solve,  with  some  authors 
even  going  so  far  as  to  say  that  we  possibly  should  resign  ourselves  to  the  problems  not 
being  solved,  concentrating  on  software  methods  instead  [Ledbetter,  1988|. 

However,  in  [Matloff,  Kowel  and  Eldering,  1988]  a  radically  new  interconnect  method 
was  presented  which  will  solve  these  problems,  and  have  other  advantages  as  well;  The 
new  interconnect  will  be  usable  for  both  fine-grained  and  coarse-grained  types  of  applica¬ 
tions;  it  will  solve  the  long-standing  problems  of  contention  for  memory  and  for  the 
processor/ memory  switch  in  TC  systems;  in  LC  systems,  it  will  enable  a  truly  dramatic 
improvement  in  interprocessor  communications  bandwidth,  and  again  totally  eliminate 
contention  for  the  interconnect  switch;  our  approach  should  also  be  superior  to  other 
optical  processor/memory  interconnects  which  have  been  proposed,  e.g.  optical  crossbars 
[Bell,  1986;  Hutcheson  et  at,  1987];  the  pin-limitation  problem  is  also  eliminated,  which  is 
a  problem  even  in  those  architectures  which  have  been  proposed  based  on  an  optical  fiber 
interconnect. 

Our  name  for  this  new  interconnect  is  OPTIMUL,  an  acronym  for  Optical  Multipro¬ 
cessor  Interconnect.  The  central  feature  is  an  optical  processor- memory  channel,  which 
will  allow  simultaneous  access  of  a  memory  chip,  where  the  word  “simultaneous"  is 
n\^ant  both  with  respect  to  all  bits  in  the  chip,  and  with  respect  to  all  processors.  In 
other  words,  all  processors  can  simultaneously  read  the  entire  contents  of  a  chip,  with  no 
interference  at  all.  Write  access  is  of  course  restricted  to  a  single  processor  at  a  time,  but 
it  still  is  simultaneous  across  bits  in  the  chip,  i.e.  an  entire  chip  can  be  written  in  one 
»«cess.  This  optical  channel  is  described  in  Section  2,  and  then  MP  system  architectures 
utilising  it  will  be  proposed  in  Section  3.  Sections  4  and  5  will  present  some  performance 
analyses  of  these  architectures. 
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2.  A  New  Optical  Memory  Acceu  Channel 

Consider  devices  Oj,  Oj  which  wish  to  read  a  memory  chip  C,  in  which  are  stored 
bits  Bi,  B,.  We  will  report  here  a  technique  in  which  the  devices  can  read  from  C 
optically,  bypassing  the  need  for  using  the  chip’s  pins,  and  which  will  allow  this  access  to 
be  simultaneous,  with  respect  to  both  devices  D,  and  bits  Bj  (Figure  1). 

To  achieve  this,  C  will  be  illuminated  and  mechanisms  used  to  cause  the  reflected 
light  beam  to  be  intensity  modulated  by  the  electric  fields  at  each  position  in  C.  Thus 
the  reflected  beam  will  contain  a  complete  bit  map  of  the  contents  of  C.  The  beam  will 
be  demodulated  by  optical  apparatus  for  focusing  onto  the  receivers  Dj . 

Some  preliminary  implementation  details  were  given  in  [Matloff,  et  al,  1988j.  An 
updated  is  given  in  the  following; 

To  achieve  the  desired  modulation  effect,  we  are  pursuing  two  strategies,  one  based 
on  advanced  ferroelectric  liquid  crystals,  and  the  other  using  thin  solid  film  structures 
containing  highly  nonlinear  dyes.  Either  material  would  be  used  to  coat  over  the  surface 
of  the  chip  C  above  (or  a  group  of  such  chips). 

The  fields  on  the  surface  of  typical  IC’s  are  of  magnitude  on  the  order  of  volts//rm, 
larger  than  the  fields  supplied  by  the  electrodes  in  a  typical  liquid  crystal  display.  This 
fact  led  to  the  demonstration  of  an  electro-optical  method  for  testing  integrated  circuits 
'Burns,  19791.  Problems  such  as  long  switching  times  (*  10  ms)  have  recently  been 
resolved,  with  switching  times  on  the  order  of  100  ns,  and  even  faster  operation  appears 
to  be  possible  (Johnson,  el  al,  1987]. 

We  also  have  been  examining  the  feasibility  of  using  thin  solid  organic  films  as  the 
coating  material  to  be  used  to  effect  the  light  modulation.  Such  materiab  appear  promis¬ 
ing,  and  would  offer  a  tradeoff  of  higher  speed  for  lower  image  contrast  (Kowei,  et  al, 
1987]  Kowei,  1985).  We  are  investigating  synthesis  and  deposition  techniques,  and  are 
collecting  electro-optical  measurements  to  evaluate  the  potential  of  these  films. 

Demodulation  of  the  beam  back  to  storage  as  electric  fields  at  the  receivers  is  to  be 
accomplished  by  the  use  of  photosensitive  technology.  For  example,  one  possibility  is  to 
use  ordinary  DRAM  memories,  which  have  a  natural  sensitivity  to  light.  This  means 
also  that  parts  of  C  must  be  masked  from  the  light,  so  that  illumination  of  C  does  not 
change  the  contents  of  bits  in  C;  e.g.  only  the  output  portion  of  a  gate  can  be  exposed. 

However,  in  commercially  produced  chips,  this  photosensitivity  of  DRAM’s  may  not 
be  uniform  enough  for  reliable  use  as  demodulators,  since  the  sensitivity  is  a  byproduct, 
not  a  primary  specification.  Thus,  we  are  taking  other  approaches  instead,  based  on 
photodiodes.  We  have  designed  and  simulated  such  a  receiving  device  iLoving  and  Elder- 
ing,  1988).  In  fact,  other  such  memories  have  been  proposed  iKosnocky,  1971]  Ullman  cl 
al,  IQGCj. 

In  this  way,  the  values  stored  at  all  the  bits  R,  in  C  can  be  transmitted  optically  to 
the  devices  D,,  timallaneoaily  over  all  su6aeript$  i  and;.  Clearly,  the  simultaneity  over  i 
will  have  highly  significant  implications  for  the  memory  and  interconnect  contention 
problems  which  have  plagued  TC  systems,  while  the  simultaneity  over  j  will  have  an 
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equally  profound  impact  on  the  network  bandwidth  limitation  problem  in  LC  systems. 
Note  that  the  classical  bottleneck  arising  from  limitations  on  the  pins-to-stored-bits  ratio 
is  completely  bypassed  in  the  approach  described  here.  Writes  to  C  can  be  accomplished 
by  reversing  the  process. 


3.  System  Architectures 

The  optical  interconnect  presented  here  can  be  used  in  a  variety  of  configurations. 
Two  of  these  were  described  in  (Matloff,  Kowel  and  Eldering,  1988|,  which  will  be  sum¬ 
marised  here: 


Architecture  I: 

This  configuration  features  optica]  memory  reads,  but  used  electronic  writes,  the 
latter  being  motivated  by  a  desire  for  simplicity  in  the  first  prototype  to  be  constructed, 
and  by  the  fact  that  the  electronic  bus,  with  a  standard  Test-and-Set  cycle  or  similar 
mechanism,  avoids  the  interprocess  synchronisation  problem  which  must  be  solved  in  a 
purely  optical  system. 


Architecture  H: 

This  configuration  features  both  optica]  reads  and  writes.  It  is  intensive  in  memory 
quantity  needed,  with  essentially  separate  memory  modules  being  used  for  reads  and 
writes.  Interprocess  synchronisation  is  handled  by  message-passing  techniques  [Peterson 
and  Silberschats,  198S|,  the  implementation  of  which  were  given  in  examples  in  [Matloff, 
Kowel  and  Eldering,  19881. 


Other  Architectures: 

The  above  two  architectures  are  just  two  examples;  many  other  configurations  are 
possible.  For  example,  purely  LC  systems  can  be  formed,  e.g.  as  tree  or  ring  networks 
(see  Section  5).  But  again,  instead  of  the  serial  interprocessor  communication  available 
in  ordinary  ring  networks,  the  optical  channels  introduced  here  would  provide  exceed¬ 
ingly  highly  parallel  communication.  This  is  currently  being  investigated  [Matloff  and 
Schubert,  1988|. 

Another  possibility  is  to  set  up  memory  hierarchies.  In  this  setting,  motivated  by  a 
desire  to  conserve  on  optical  apparatus,  only  some  memory  access  would  be  optical,  with 
the  optically  accessed  memories  serving  in  the  role  of  cache  front  ends  for  much  larger 
memories. 


4.  Performance  Analysis;  Simulation  of  a  Continuum  of  Systems  with  Vary¬ 
ing  Degrees  of  Coupling 

Numerous  mathematical  analyses  of  multiple  access  of  memory  systems  have  been 


33 


preseated  (a  Dice  collection  of  references  appears  in  the  introduction  to  Chapter  6  of 
[^rawal,  1986|).  However,  for  the  present  purpose,  a  simulation  analysis  was  preferred, 
in  the  interests  of  (a)  simplicity,  and  (b)  modeling  OPTIMUL’s  ability  of  a  processor  to 
do  a  parallel  access  of  a  large  data  structure. 

Specifically,  we  set  up  the  following  model,  which  can  be  considered  as  an  abstraction 
which  is  representative  of  a  number  of  architectures  which  could  be  developed  using  the 
optical  interconnect  introduced  in  (Matloff,  Kowel  and  Eldering,  1988].  We  will  refer  to 
the  abstracted  system  here  by  the  same  name,  OPTIMUL. 

In  this  system  we  have  p  processors  viewing  a  central  shared  memory  of  m  modules. 
Consider  the  operation  of  one  processor  F.  P  will  alternate  between  periods  of  memory 
access  and  nonaccess.  We  assume  the  nonaccess  time  (measured  in  units  of  memory 
cycles)  has  a  geometric  distribution  with  mean 

The  model  then  assumes  that  at  the  start  of  an  access  period,  P  will  send  to  a 
memory  controller  a  request  for  R^i,  consecutive  words  in  the  memory  space,  e  g.  a 
request  to  read  an  entire  array  or  subarray.  R„i,  is  assumed  to  have  a  geometric  distri¬ 
bution  with  mean  /<„<,. 

We  are  comparing  OPTIMUL  to  a  conventional  MP  system.  There  is  extremely  wide 
variation  in  “conventional”  MP  systems;  the  model  cannot  incorporate  all  of  them. 
Instead  the  model  has  been  designed  so  that  variation  of  its  parameters  will  allow  model¬ 
ing  of  a  range  of  situations  suitable  for  comparison  to  OPTIMUL;  this  will  be  seen  below. 

In  the  conventional  system,  it  is  assumed  that  the  memory  controller  will  satisfy  the 
requests  made  by  P  in  whatever  order  they  become  satisfiable,  similar  to  the  “C” 
organization  (Hwang  and  Briggs,  1984)  (Kogge,  1981],  with  consecutive  words  stored  in 
consecutive  modules  (mod  m),  i.e.  using  low-order  interleaving.  If  one  of  the  words 
requested  by  P  encounters  contention  with  a  request  from  another  processor,  one  of  the 
processors  must  wait.  If  a  requested  module  is  free,  it  takes  one  unit  of  time  to  satisfy  a 
request  for  one  word  of  memory. 

On  the  other  hand,  in  modeling  OPTIMUL,  we  are  assuming  that  any  request  takes 
only  one  unit  of  time  to  service,  for  any  value  of  R^^,,  i.e.  OPTIMUL  will  access  all 
in  one  time  unit,  due  to  OPTlMUL's  ability  to  transfer  the  entire  contents  of  a 
memory  chip  in  parallel.  [For  this  reason,  this  way  would  be  most  fully  exploited  if  the 
Af),,  's  (as  in  Architecture  I)  were  contained  within  the  processors,  we  are  making  such  an 
assumption  here.  On  the  other  hand,  in  some  ways  our  model  is  too  conservative,  i.e.  it 
actually  underestimates  OPTIMUL’s  potential;  this  will  be  explained  below. 

The  simulation  actually  measures  the  performance  of  our  model, conventional  MP 
system,  rather  than  OPTIMUL  itself.  The  mean  delay  per  memory  access,  Dq,  is  found 
for  the  conventional  system.  Under  the  model  described  here,  the  corresponding  mean 
delay  for  OPTIMUL  is  exactly  1.0.  Thus  may  be  used  as  a  figure  of  merit  for 
OPTIMUL,  i.e.  a  measure  of  the  speedup  in  memory  access  obtained. 

We  have  noted  that  one  of  OPTIMUL’s  important  advantages  is  that  it  can  operate 
in  both  TC  and  LC  modes.  This  is  the  motivation  behind  our  model  for  the  memory 
access  of  a  conventional  MP: 
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TC  Model: 


We  model  the  “typical”  TC  system  sa  having  a  fairly  small  value  of  2.0. 

This  reflects  the  fact  that  TC  systems  are  appropriate  for  applications  in  which  the 
processors  must  communicate  with  each  other  fairly  often,  and  that  they  do  so  by 
accessing  ■  However,  such  accesses  are  usually  for  only  one  word,  or  a  small 
number  of  words.  To  reflect  this,  we  set  to  be  fairly  small  in  our  simulator. 

To  model  TC  systems  which  exist  today,  in  which  the  number  of  processors  is  lim¬ 
ited,  we  will  set  the  number  of  processors  p  to  be  small  'm  our  TC  simulations, 
speciflcally  16. 

LC  Model: 

Here  we  set  p  to  be  a  larger  number  (64)  in  our  simulator,  reflecting  the  situa¬ 
tion  in  many  current  LC  systems  (of  course,  many  such  systems  are  even  larger 
than  this).  Also,  since  LC  systems  are  set  up  for  applications  in  which  the  proces¬ 
sors  communicate  with  each  other  less  frequently,  we  have  set  to  be  fairly 

large  (100.0).  However,  when  LC  systems  do  communicate  with  each  other,  it  tends 
to  be  with  relatively  large  amounts  of  data{  thus  we  have  set  p„^,  to  be  fairly  large 
in  our  simulation,  with  a  value  of  100.0. 

Both  the  TC  and  LC  models  include  m  =  16  memcry  modules  in 

Note  that  these  models  will  severely  underestimate  OPTIMUVi  potential,  in  a 
number  of  ways.  For  example,  the  TC  model  tacitly  aaaumes  that  the  processor/memory 
interconnect  switch  for  the  conventional  MP  machine  is  in  the  form  of  a  crossbar,  which 
is  not  typical  in  MP  systems,  and  is  actually  infeasible  for  the  larger  ones.  Thus  the 
model  for  the  conventional  MP  machine  does  not  incorporate  any  queueing  delay  due  to 
the  interconnect  switch;  as  mentioned  above,  such  delay  can  be  quite  large,  and  thus  this 
results  in  underestimating  OPTIMUL'a  potential.  Of  course  this  built-in  bias  against 
OPTIMUL  will  be  even  worse  in  our  LC  model,  since  the  interconnect  queueing  delay  is 
much  worse  in  that  case;  we  are  not  allowing  for  network  traffic  delay  at  all  in  this  sim¬ 
ple  analysis. 

A  large  number  of  simulation  runs  were  conducted,  but  instead  of  reporting  all  of 
them,  we  will  concentrate  on  three  representative  examples: 

Example  A; 

This  is  a  TC  model,  with  =  1.0.  This  setting  can  be  expected  to  give  only 
a  modest  advantage  to  OPTIMUL  over  conventional  machines,  due  to  the  above- 
mentioned  lack  of  interconnect  queueing  delay  in  our  model.  However,  we  still  found 
that  the  figure  of  merit  Dc  was  1.34,  i.e.  even  this  setting's  bias  against  OPTIMUL. 
OPTIMUL  has  a  34%  advantage. 

Example  B: 

This  too  is  a  TC  model,  but  with  =  lO.O,  representing  a  situation  in 

which  the  P,  are  vector  processors.  This  models  a  setting  in  which  most  memory 


accesaes  of  a  proctosor  are  for  scalars,  but  occasionally  a  vector  access  is  made. 
Here  we  found  that  Dc  =  12.44,  a  l2-foid  advantage  for  OPTfMLL. 

Example  C: 

This  is  the  LC  model  described  above.  Here  OPTIMUL  has  a  very  dramatic 
advantage  over  a  conventional  system,  with  Dc  =  277.35  (and,  as  mentioned  above, 
this  number  is  probably  an  underestimate  of  the  true  value). 

In  addition,  one  of  OPTfMUL's  most  significant,  advantages  is  invisible  in  the  simula¬ 
tion  study,  namely  the  feasibility  of  using  a  much  larger  number  p  of  processo.s  in  a  TC 
system.  The  limitations  of  crossbars  (or  their  more  sophisticated  variations)  on  p  imply 
that  it  would  be  infeasible  to  use  TC  systems  in  applications  having  a  very  high  degree 
of  inherent  parallelism.  The  optical  interconnect  nature  of  OPTIMUL  should  make  it 
much  more  feasible  to  build  large  TC  systems,  so  that  more  highly  parallel  applications 
may  be  handled. 


S.  PeiTormance  Analysis;  Case  Study  of  a  Sorting  Application 

In  Section  4.  we  presented  an  analysis  based  on  abstraction  of  memory  access  pat¬ 
terns  in  multiprocessor  systems.  This  analysis  showed  the  potential  of  OPTIMUL  to  be 
quite  dramatic  for  some  settings  of  the  simulation  parameters.  However,  additional 
understanding  is  gained  by  investigating  the  performance  of  OPTIMUL  on  a  specific 
application,  which  is  done  in  this  section.  The  analysis  here  is  basically  a  trace-driven 
simulation  of  the  performance  of  our  proposed  system  on  sorting  problems. 

The  analy.,is  assumes  that  OPTIMUL’s  processors  are  of  speed  comparable  to  that  of 
a  VAX  8600.  Single-processor  computation  times  used  below  were  obtained  by  using  the 
Unix  ‘time’  command  to  get  processor  run  times  for  actual  C  code  for  the  sort  algorithm 
specified  below. 

The  processors  are  assumed  to  be  set  up  as  an  LC  system  in  a  ring  topology.  The 
OPTIMUL  version  of  this  system  is  assum’d  to  have  optical  neighbor-to-r.eighbor  links 
which  use  the  technology  described  above,  which  the  capability  of  transferring  millions  of 
bits  in  hunoreds  of  nanoseconds;  interprocessor  communication  time  is  essentially  negligi¬ 
ble  in  this  system.  The  non-OPTIMUL  version  of  the  system  has  “conventional" 
neighbor-to-neighbor  links  having  transmission  rates  of  50  megab'ts/second.  Links  of 
this  speed  or  better  are-beginning  to  appear,  u.g.  the  “semi-LC"  V.AX  Cluster  sytems 
'Kronenberg,  Levy  and  Strecker,  1988|,  this  rate  is  much  faster  than  is  typical  among 
most  LC  systems  to  date,  e.g.  the  Hypercube. 

• 

The  'ort  algorithm  used  was  Quickmerge  (Quinn,  1987',  which  consists  of  three 
phases.  During  Phase  I  each  processor  sorts  a  subset  of  the  array  using  Quicksort. 
These  subsets  must  then  be  merged  to  complete  the  sort.  Before  the  merge  phase.  Phase 
II,  a  search  phase,  Phase  III,  is  added  so  that  the  merge  task  car  be  divided  among  all 
the  processors.  Processors  search  for  dividers  to  partition  each  of  the  sorted  subsets  such 
tnat  there  is  no  value  in  partition.  .  of  any  subset  j  which  is  greater  than  any  value  in 
partition. of  any  subset  k.  uV^ing  the  merge  phase,  each  processor  joins  together  a 
set  of  partitions  which  share  common  dividers.  Because  of  the  divisions  performed  in  step 


36 


2 


two,  merged  part  tioo.  precedes  merged  partitionj^j. 

On  an  LC  system,  the  communication  between  phases  is  substantial: 


(a)  Before  the  initial  (sort)  phase,  each  processor  must  receive  a  subset  of  the  array 
to  sort.  These  subsets  are  sent  by  the  lead  processor,  relayed  from  processor  to 
processor  along  the  ring  until  reaching  the  desired  destination  processor. 

(b)  Before  the  second  (search)  phase  begins,  each  processor  must  receive  the  sort 
phase  results  from  all  other  processors. 

(c)  Before  the  final  (merge)  phase,  the  partition  dividers  must  be  passed  to  each  of 
the  processors  (note  that  the  data  is  already  present  in  each  processor's  private 
memory). 

(d)  Finally,  the  merged  partitions  must  be  returned  to  the  lead  processor  for  con¬ 
catenation. 

The  entire  array  must  be  broadcast  three  times.  As  the  total  communications  cost  is 
dominated  by  this  data  movement,  we  won’t  consider  the  transfer  time  of  the  partition 
dividers. 

Within  an  OPTIMUL  ring  configuration,  memory  would  appear  to  be  shared  since 
information  could  be  transfered  continuously  around  the  ring.  As  OPTIMUL  allows  a 
complete  memory-memory  transfer  in  one  memory  cycle,  data  can  be  transferred  (broad¬ 
cast)  to  all  processors  in  p  memory  cycles  where  p  is  the  number  of  processors  on  the 
ring.  Preliminary  study  suggests  that  we  will  be  able  to  transfer  the  contents  of  one 
memory  chip  to  another  in  less  than  SOOns,  and  that  this  time  can  be  reduced  to  less 
than  lOOns.  Even  given  the  slower  speed,  data  could  be  broadcast  to  all  members  of  a  64 
member  ring  in  about  32^s  (  63*S00ns  =  31S00ns  <  <  1  ms  ).  This  is  a  substantial  sav¬ 
ings  over  the  alternatives  discussed  above,  even  ignoring  the  propagation  delay  around 
the  ring. 

Below  are  tables  indicating  approximate  times  for  the  Quickmerge  algorithm  were  the 
algorithm  executed  on  OPTIMUL  and  non-OPTIMUL  ring  as  described  above.  The 
improvements  look  modest  in  comparison  with  that  of  Examp’'  'I  in  the  last  section,  but 
still  are  quite  impressive,  with  speeds  double  and  triple  those  of  the  conventional  LC  sys¬ 
tem.  The  largest  improvement  reported  occurs  for  a  128  processor  system  sorting  a  2S6k 
integer  array.  Here  the  OPTIMUL  system  would  perform  approximately  three  and  a  half 
times  faster  than  a  non-OPTIMUL  system  having  the  same  number  of  processors.  For 
larger  problems  and  more  processors,  larger  speedup  factors  might  be  observed.  jOn  the 
other  hand,  it  appears  that  additional  tuning  of  the  algorithm  could  be  done  for  the 
non-OPTIMUL  setting,  and  the  gap  in  performance  narrowed  somewhat.] 

More  detailed  analyses,  including  the  implementation  details  for  such  a  ring 
configuration,  are  currently  in  progress  jMatloff  and  Schubert,  19881. 

1  he  gains  reported  here  are  significant,  but  modest  in  comparison  to  the  most 
extreme  gains  presented  in  Section  4.  In  that  light,  it  must  once  again  be  pointed  out 
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that  spetdup  factors  are  highly  application-dependent.  In  particular,  in  the  sorting 
application  analysed  here,  there  is  a  fundamental  obstacle  to  speedup,  in  terms  of  the 
relative  site  of  computation  and  communication  times: 

Consider  sorting  n  items  on  p  processors,  by  part  tioning  into  blocks  of  approximate 
siie  n/p  each.  The  computation  time  is  approximately  C  n/p  log(n/p)  for  some  C, 
assuming  that  all  subproblems  finish  at  roughly  the  same  time;  this  is  not  a  bad  assump¬ 
tion,  since  the  standard  deviation  of  sort  times  is  small  compared  to  the  mean  Gonnet, 
1984|.  (For  simplicity,  we  are  ignoring  the  merge  phase  in  the  analysis  below, |  The  com¬ 
munications  time  is  roughly  D  (n/p)  p  (n/p  amount  of  data  being  passed  through  p 
nodes)  for  some  D. 

Fix  p  and  vary  n.  If  the  ratio  n,'p  is  too  small,  then  very  little  data  is  being  passed 
from  node  to  node,  not  enough  to  fully  exploit  the  highly  parallel  data  transmission 
capability  in  OPTIMUL.  On  the  other  hand,  as  n  grows,  the  computation  time  tends  to 
dominate  the  communication  time.  In  this  setting,  OPTIMUL’s  communications  advan¬ 
tages  will  be  quite  substantial  over  non-OPTIMUL  systems,  but  the  advantages  will  not 
be  important,  since  communications  times  will  be  a  minor  proportion  of  the  total  times 
anyway. 

In  other  words,  applications  such  as  sorting,  having  computation  times  which  are 
more  than  0(n),  are  poor  candidates  for  studies  whose  aim  is  to  investigate  interproces¬ 
sor  communications  costs.  In  such  applications,  inefficient  interproeessor  communication 
might  not  be  penalised  much.  Searching  applications,  with  0(n)  or  O(log  n)  computa¬ 
tional  times,  should  much  more  fully  exploit  OPTIMUL’s  hugh  communications 
bandwidth  capabilities,  and  are  currently  under  investigation. 
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of 

Rome  Air  Development  Center 


RADC  plans  and  executes  research,  development,  test  and 
selected  acquisition  programs  in  support  of  Command,  Control, 
Communications  and  Intelligence  (C^I)  activities.  Technical  and 
engineering  support  within  areas  of  competence  is  provided  to 
ESD  Program  Offices  (POs)  and  other  ESD  elements  to 
perform  effective  acquisition  of  C^I  systems.  The  areas  of 
technical  competence  include  communications,  command  and 
control,  battle  management  information  processing,  surveillance 
sensors,  intelligence  data  collection  and  handling,  solid  state 
sciences,  electromagnetics,  and  propagation,  and  electronic 
reliability /maintainability  and  compatibility. 
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